Data Science project Boilerplate

A data science boilerplate, in the context of software development and data science projects, refers to a standardized and reusable set of code, templates, libraries, and best practices that are pre-defined and organized to kickstart a data science project. It serves as a foundation or starting point for data scientists and analysts, helping them save time and effort when beginning a new project or analysis. Here’s an ideal data science boilerplate that I recommend for basically any project.

The folder structure

.
├── data
│   └── data.csv
├── Dockerfile
├── Makefile
├── .env
├──.envrc
├── .gitignore
├── notebooks
│   └── datascientist_deliverable.ipynb
├── README.md
├── requirements.txt
├── scripts
│   └── script.py
├── setup.py
└── thepkg
    ├── __init__.py
    ├── interface
    │   └── __init__.py
    ├── ml_logic
    │   ├── data.py
    │   ├── __init__.py
    │   ├── model.py
    │   └── preprocessor.py
    ├── params.py
    └── utils.py

This folder structure represents a typical directory layout for a data science project. Each folder and file serves a specific purpose in organizing and managing the project’s code, data, documentation, and other resources. I will now give an explanation of each item in the structure:

data: This folder contains the project’s data files. In this case, there is a single CSV file named data.csv, but you can add more data files as needed.
Dockerfile: This file is used to define the instructions for creating a Docker container for your project. Docker allows you to encapsulate your project environment and dependencies for consistency and portability. This is a key element for most of my projects since I love to delevope inside a Docker Container (see my other blog post on how to use Docker with ‘bind mounts’.)
Makefile: A Makefile contains a set of rules and commands for building, testing, and running various aspects of your project. It can automate common development tasks.
.env: This file is often used to store environment variables specific to your project. These variables can include API keys, database connection strings, or other sensitive information.
.envrc: This file is typically used in conjunction with a tool like direnv to manage environment variables for your project, ensuring that the correct environment is set up when you enter the project directory.
.gitignore: This file specifies files and folders that should be ignored by Git when tracking changes. It helps avoid including sensitive or unnecessary files in version control.
notebooks: This folder is meant for Jupyter notebooks used for data exploration, analysis, and documentation. In this case, there’s a single notebook file named datascientist_deliverable.ipynb.
README.md: This Markdown file is used to provide an overview and documentation of the project. It typically includes project goals, setup instructions, usage examples, and other relevant information.
requirements.txt: This file lists the Python packages and their versions required for the project. You can use it to recreate the project’s environment on another system.
scripts: This folder is intended for Python scripts that are part of your project. In this example, there’s a single script file named script.py.
setup.py: This is a Python script used for packaging and distributing your project as a Python package and is directly related with the folder `thepkg`. It’s often used when you want to share your code with others or publish it on platforms like PyPI.
thepkg: This folder represents the main Python package of your project. In development phase we would install the package using the classic pip install . e (editablemode) to updated function without having to reinstalling it over and over again. The folder is organized in a way that follows Python package conventions:
- init.py: These files indicate that the directories are Python packages and can be imported as modules.
- interface: This subpackage appears to be a module for defining an interface or API for your project.
- ml_logic: This subpackage seems to contain modules related to machine learning logic, including data processing (data.py and preprocessor.py) and modeling (model.py).
- params.py: This file could contain project-specific configuration parameters or settings.
- utils.py: This file likely contains utility functions or helper code used throughout the project.

Overall, this folder structure provides a well-organized framework for a data science project, making it easier to collaborate, manage dependencies, and maintain consistency in your work.

Remember that I boilerplate is just a template. Feel free to use this structure as a first step into your new Data Science project.

Finally, here is a little bash script to create the entire folder structure.

Code

#!/bin/bash

# Create the main project directory
mkdir -p my_data_science_project

# Create subdirectories and files
cd my_data_science_project

mkdir -p data
touch data/data.csv

touch Dockerfile
touch Makefile
touch .env
touch .envrc
touch .gitignore

mkdir -p notebooks
touch notebooks/datascientist_deliverable.ipynb

touch README.md
touch requirements.txt

mkdir -p scripts
touch scripts/script.py

touch setup.py

mkdir -p thepkg
touch thepkg/__init__.py

mkdir -p thepkg/interface
touch thepkg/interface/__init__.py

mkdir -p thepkg/ml_logic
touch thepkg/ml_logic/data.py
touch thepkg/ml_logic/__init__.py
touch thepkg/ml_logic/model.py
touch thepkg/ml_logic/preprocessor.py

touch thepkg/params.py
touch thepkg/utils.py